-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a section to the docs that compares cuDF with Pandas #10796
Add a section to the docs that compares cuDF with Pandas #10796
Conversation
Note that we do not support custom data types like Pandas' | ||
`ExtensionDtype`. | ||
|
||
## Null (or "missing") values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great place to call out the subtle differences in null handling logic we have vs pandas. Most of it can be dug up from the source code here but a good summary might be something like this (I think this is all of them?)
Nulls in cuDF behave differently from pandas in several edge cases. In cuDF, the rule is that nulls always propagate, whereas in pandas they may not if the mathematical result can be inferred without knowing the missing value:
- `NA ** 0 == 1`
- `1 ** NA == 1`
- `NA | True == True`
- `True or NA == True`
- `False and NA == False`
Maybe a table or something might be better than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these cases are also described in the docs (as a cross-reference with the source code linked above):
- https://pandas.pydata.org/docs/user_guide/missing_data.html#propagation-in-arithmetic-and-comparison-operations
- https://pandas.pydata.org/docs/user_guide/missing_data.html#logical-operations
I find it a little concerning that we differ in this way because it means that cuDF cannot be consistent in its behaviors between scalars and columns. It should be specifically noted that scalar operations act like Pandas (because we use the same magic NA
singleton object), and column operations always propagate NA
.
>>> import cudf
>>> cudf.NA ** 0
1
>>> cudf.Scalar(cudf.NA, dtype=float) ** 0
Scalar(1.0, dtype=float64)
>>> cudf.Series([cudf.NA], dtype=float) ** 0
0 <NA>
dtype: float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the difference in column vs scalar behaviour is problematic. I think @brandon-b-miller has thought a lot about this, where maybe we should take this discussion offline and come back and raise a separate issue if needed.
For this PR, I'll hold off on adding any further information about null behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend thoroughly reading the discussion on pandas-dev/pandas#29997 before we relitigate any of that discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussions offline, I'm going to hold off on documenting the exceptional cases here. I think our priority should be to first align the behavior of nulls in all three of the following cases:
- Scalar operations involving
NA
- Column operations involving
NA
- Operations in UDFs involving
NA
We can choose to always return NA
in all three cases, or make an exception for **
in all three cases, but we must be consistent. That done, we can come back here to document the difference from Pandas - if any.
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some small edits. Thanks @shwina!
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
…pandas-comparison-docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
marking my approval for now since my only concern was around the null behavior.
@gpucibot merge |
Adds a section to the docs that calls out the similarities and differences from Pandas at a high level.
This is inspired by CuPy's page documenting the differences from NumPy.